Learning Regular Languages over Large Alphabets
نویسندگان
چکیده
Learning regular languages is a branch of machine learning, which has been proved useful in many areas, including artificial intelligence, neural networks, data mining, verification, etc. On the other hand, interest in languages defined over large and infinite alphabets has increased in recent years. Although many theories and properties generalize well from the finite case, learning such languages is not an easy task. As the existing methods for learning regular languages depends on the size of the alphabet, a straightforward generalization in this context is not possible. In this thesis, we present a generic algorithmic scheme that can be used for learning languages defined over large or infinite alphabets, such as bounded subsets of N or R or Boolean vectors of high dimensions. We restrict ourselves to the class of languages accepted by deterministic symbolic automata that use predicates to label transitions, forming a finite partition of the alphabet for every state. Our learning algorithm, an adaptation of Angluin’s L∗, combines standard automaton learning by state characterization, with the learning of the static predicates that define the alphabet partitions. We use the online learning scheme, where two types of queries provide the necessary information about the target language. The first type, membership queries, answer whether a given word belongs or not to the target. The second, equivalence queries, check whether a conjectured automaton accepts the target language, a counter-example is provided otherwise. We study language learning over large or infinite alphabets within a general framework but our aim is to provide solutions for particular concrete instances. For this, we focus on the two main aspects of the problem. Initially, we assume that equivalence queries always provide a counter-example which is minimal in the length-lexicographic order when the conjecture automaton is incorrect. Then, we drop this “strong” equivalence oracle and replace it by a more realistic assumption, where equivalence is approximated by testing queries, which use sampling on the set of words. Such queries are not guaranteed to find counter-examples and certainly not minimal ones. In this case, we obtain the weaker notion of PAC (probably approximately correct) learnability and learn an approximation of the target language. All proposed algorithms have been implemented and their performance, as a function of automaton and alphabet size, has been empirically evaluated.
منابع مشابه
Learning Regular Languages over Large Ordered Alphabets
This work is concerned with regular languages defined over large alphabets, either infinite or just too large to be expressed enumeratively. We define a generic model where transitions are labeled by elements of a finite partition of the alphabet. We then extend Angluin’s L∗ algorithm for learning regular languages from examples for such automata. We have implemented this algorithm and we demon...
متن کاملLearning Symbolic Automata
Symbolic automata allow transitions to carry predicates over rich alphabet theories, such as linear arithmetic, and therefore extend classic automata to operate over infinite alphabets, such as the set of rational numbers. In this paper, we study the foundational problem of learning symbolic automata. We first present Λ∗, a symbolic automata extension of Angluin’s L∗ algorithm for learning regu...
متن کاملVarieties of Unranked Tree Languages
We study varieties that contain unranked tree languages over all alphabets. Trees are labeled with symbols from two alphabets, an unranked operator alphabet and an alphabet used for leaves only. Syntactic algebras of unranked tree languages are defined similarly as for ranked tree languages, and an unranked tree language is shown to be recognizable iff its syntactic algebra is regular, i.e., a ...
متن کاملRegular Expressions for Languages over Infinite Alphabets
In this paper we introduce a notion of a regular expression over infinite alphabets and show that a language is definable by an infinite alphabet regular expression if and only if it is acceptable by finite-state unification based automaton – a model of computation that is tightly related to other models of automata over infinite alphabets.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014